Part Number Hot Search : 
14070 SZ1547 D1252 54ACTS74 7805SR UM82C451 EPSA1 CVCO55B
Product Description
Full Text Search
 

To Download 21928 Datasheet File

  If you can't view the Datasheet, Please click here to try to view without PDF Reader .  
 
 


  Datasheet File OCR Text:
  3dnow! technology manual tm
trademarks amd, the amd logo, k6, 3dnow!, amd athlon, and combinations thereof, and k86 are trademarks, and amd-k6 is a registered trademark of advanced micro devices, inc. mmx is a trademark of intel corporation. other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. ? 2000 advanced micro devices, inc. all rights reserved. the contents of this document are provided in connection with advanced micro devices, inc. (amd) products. amd makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. no license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. except as set forth in amds standard terms and conditions of sale, amd assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. amds products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of amds product could create a situation where personal injury, death, or severe property or environmental damage may occur. amd reserves the right to discontinue or make changes to its products at any time without notice.
contents iii 21928g/0march 2000 3dnow!? technology manual contents revision history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 3dnow!? technology 1 introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 key functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 feature detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 register set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3dnow!? instruction formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 execution resources on amd-k6 ? processors . . . . . . . . . . . . 11 task switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 exceptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 3dnow!? instruction set 17 femms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 pavgusb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 pf2id . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 pfacc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 pfadd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 pfcmpeq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 pfcmpge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 pfcmpgt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 pfmax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 pfmin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 pfmul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 pfrcp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 pfrcpit1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 pfrcpit2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
iv contents 3dnow!? technology manual 21928g/0march 2000 pfrsqit1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 pfrsqrt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 pfsub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 pfsubr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 pi2fd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 pmulhrw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 prefetch/prefetchw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3 division and square root 59 division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 divide examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 square root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 square root examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
list of figures v 21928g/0march 2000 3dnow!? technology manual list of figures figure 1. 3dnow!?/mmx? registers . . . . . . . . . . . . . . . . . . . . . . . . 5 figure 2. 3dnow! data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 figure 3. single-precision, floating-point data format. . . . . . . . . . 6 figure 4. integer data types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 figure 5. register x unit and register y unit resources . . . . . . 13
vi list of figures 3dnow!? technology manual 21928g/0march 2000
list of tables vii 21928g/0march 2000 3dnow!? technology manual list of tables table 1. 3dnow!? technology exponent ranges. . . . . . . . . . . . 10 table 2. 3dnow! floating-point instructions. . . . . . . . . . . . . . . . 14 table 3. 3dnow! performance-enhancement instructions . . . . 14 table 4. 3dnow! and mmx? instruction exceptions . . . . . . . . 15 table 5. numerical range for the pf2id instruction. . . . . . . . . 22 table 6. numerical range for the pfacc instruction . . . . . . . . 24 table 7. numerical range for the pfadd instruction. . . . . . . . 26 table 8. numerical range for the pfcmpeq instruction . . . . . 28 table 9. numerical range for the pfcmpge instruction . . . . . 30 table 10. numerical range for the pfcmpgt instruction . . . . . 32 table 11. numerical range for the pfmax instruction . . . . . . . 34 table 12. numerical range for the pfmin instruction . . . . . . . . 36 table 13. numerical range for the pfmul instruction . . . . . . . 38 table 14. numerical range for the pfrcp instruction . . . . . . . . 40 table 15. numerical range for the pfrcpit1 instruction . . . . . 42 table 16. numerical range for the pfrcpit2 instruction . . . . . 44 table 17. numerical range for the pfrsqit1 instruction . . . . . 46 table 18. numerical range for the pfrsqrt instruction . . . . . 48 table 19. numerical range for the pfsub instruction . . . . . . . . 50 table 20. numerical range for the pfsubr instruction . . . . . . 52 table 21. summary of prefetch instruction type options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
viii list of tables 3dnow!? technology manual 21928g/0march 2000
revision history ix 21928g/0march 2000 3dnow!? technology manual revision history date rev description feb 1998 a initial release feb 1998 b clarified cpuid usage in feature detection on page 3. may 1998 c revised description of 3dnow! instructions in definitions on page 9. may 1998 c revised function descriptions in table 2, 3dnow!? floating-point instructions, on page 14. sept 1998 d revised code example for the pfrsqrt instruction on page 48. sept 1998 d changed exceptions generated for the prefetch/prefetchw instructions to none, deleted exception table, and revised prefetchw description on page 56. sept 1998 d added punpckldq instruction to the division example (24-bit precision) on page 60. nov 1998 e added sample code that tests for the presence of extended function 8000_0001h on page 3. nov 1998 e clarified instruction descriptions of pfrcpit1 on page 41, pfrcpit2 on page 43, and pfrsqit1 on page 45. nov 1998 e added punpckldq instruction and clarified comments to the square root examples on page 62. aug 1999 f changed x variable to z in newton-raphson recurrence definitions, and swapped order of pfmul and punpckldq instructions in square root example (24-bit precision) in chapter 3 on page 59. aug 1999 f added references to the amd athlon? processor throughout the manual. mar 2000 g updated and clarified the pfacc instruction operation description on page 23.
x revision history 3dnow!? technology manual 21928g/0march 2000
21928g/0march 2000 3dnow!? technology manual chapter 1 3dnow!? technology 1 1 3dnow!? technology introduction 3dnow!? technology is a significant innovation to the x86 architecture that drives today's personal computers. 3dnow! technology is a group of new instructions that opens the traditional processing bottlenecks for floating-point-intensive and multimedia applications. with 3dnow! technology, hardware and software applications can implement more powerful solutions to create a more entertaining and productive pc platform. examples of the type of improvements that 3dnow! technology enables are faster frame rates on high-resolution scenes, much better physical modeling of real-world environments, sharper and more detailed 3d imaging, smoother video playback, and near theater-quality audio. amd has taken a leadership role in developing these new instructions that enable exciting new levels of performance and realism. 3dnow! technology was defined and implemented in collaboration with independent software developers, including operating system designers, application developers, and graphics vendors. it is compatible with today's existing x86 software and requires no operating system support, thereby enabling 3dnow! applications to work with all existing operating systems. 3dnow! technology is implemented on the amd-k6 ? -2, amd-k6-iii , and amd athlon? processors. the
2 3dnow!? technology chapter 1 3dnow!? technology manual 21928g/0march 2000 amd athlon processor implements five new 3dnow! technology instructions that add streaming and digital signal processing (dsp) technologies. for more information, see the amd extensions to the 3dnow!? and mmx? instruction sets manual, order# 22466. key functionality the 3dnow! technology instructions are intended to open a major processing bottleneck in a 3d graphics application floating-point operations. today's 3d applications are facing limitations due to the fact that only one floating-point execution unit exists in the most advanced x86 processors. the front end of a typical 3d graphics software pipeline performs object physics, geometry transformations, clipping, and lighting calculations. these computations are very floating-point intensive and often limit the features and functionality of a 3d application. the source of performance for the 3dnow! instructions originates from the single instruction multiple data (simd) implementation. with simd, each instruction not only operates on two single-precision, floating-point operands, but the microarchitecture within the processor can execute up to two 3dnow! instructions per clock through two register execution pipelines, which allows for a total of four floating-point operations per clock. in addition, because the 3dnow! instructions use the same floating-point registers as the mmx? technology instructions, task switching between mmx and 3dnow! operations is eliminated. the 3dnow! technology instruction set contains 21 instructions that support simd floating-point operations and includes simd integer operations, data prefetching, and faster mmx-to-floating-point switching. to improve mpeg decoding, the 3dnow! instructions include a specific simd integer instruction created to facilitate pixel-motion compensation. because media-based software typically operates on large data sets, the processor often needs to wait for this data to be transferred from main memory. the extra time involved with retrieving this data can be avoided by using the new 3dnow! instruction called prefetch. this instruction can ensure that data is in the level 1 cache when it is needed. to improve the time it takes to switch between mmx and x87 code, the 3dnow!
chapter 1 3dnow!? technology 3 21928g/0march 2000 3dnow!? technology manual instructions include the femms (fast entry/exit multimedia state) instruction, which eliminates much of the overhead involved with the switch. the addition of 3dnow! technology expands the capabilities of the amd family of processors and enables a new generation of enriched user applications. feature detection to properly identify and use the 3dnow! instructions, the application program must determine if the processor supports them. the cpuid instruction gives programmers the ability to determine the presence of 3dnow! technology on a processor. software applications must first test to see if the cpuid instruction is supported. for a detailed description of the cpuid instruction, see the amd processor recognition application note, order# 20734. the presence of the cpuid instruction is indicated by the id bit (21) in the eflags register. if this bit is writable, the cpuid instruction is supported. the following code sample shows how to test for the presence of the cpuid instruction. pushfd ; save eflags pop eax ; store eflags in eax mov ebx, eax ; save in ebx for later testing xor eax, 00200000h ; toggle bit 21 push eax ; put to stack popfd ; save changed eax to eflags pushfd ; push eflags to tos pop eax ; store eflags in eax cmp eax, ebx ; see if bit 21 has changed jz no_cpuid ; if no change, no cpuid once the software has identified the processors support for cpuid, it must test for extended functions by executing extended function 8000_0000h (eax=8000_0000h). the eax register returns the largest extended function input value defined for the cpuid instruction on the processor. if the value is greater than 8000_0000h, extended functions are supported. the following code sample shows how to test for the presence of extended function 8000_0001h. mov eax, 80000000h ; query for extended functions cpuid ; get extended function limit cmp eax, 80000000h ; is 8000_0001h supported? jbe no_extendedmsr ; if not, 3dnow! tech. not supported
4 3dnow!? technology chapter 1 3dnow!? technology manual 21928g/0march 2000 the next step is for the programmer to determine if the 3dnow! instructions are supported. extended function 8000_0001h of the cpuid instruction provides this information by returning the extended feature bits in the edx register. if bit 31 in the edx register is set to 1, 3dnow! instructions are supported. the following code sample shows how to test for 3dnow! instruction support. mov eax, 80000001h ; setup ext. function 8000_0001h cpuid ; call the function test edx, 80000000h ; test bit 31 jnz yes_3dnow! ; 3dnow! technology supported the processor supports all of the above features. concatenating the code examples above will produce the basis for a cpu detection software routine. a more comprehensive code example is available on the amd website at http://www.amd.com/products/cpg/bin/. register set the complete multimedia units in the processor combine the existing mmx instructions with the new 3dnow! instructions. in addition, by merging 3dnow! with mmx, it becomes possible to write x86 programs containing both integer, mmx, and floating-point graphics instructions with no performance penalty for switching between the multimedia (integer) and 3dnow! (floating-point) units. the processor implements eight 64-bit 3dnow!/mmx registers. these registers are mapped onto the floating-point registers. as shown in figure 1, the 3dnow! and mmx instructions refer to these registers as mm0 to mm7. mapping the new 3dnow!/mmx registers onto the floating-point register stack enables backwards compatibility for the register saving that must occur as a result of task switching.
chapter 1 3dnow!? technology 5 21928g/0march 2000 3dnow!? technology manual figure 1. 3dnow!?/mmx? registers aliasing the 3dnow!/mmx registers onto the floating-point register stack provides a safe method to introduce 3dnow! and mmx technology, because it does not require modifications to existing operating systems. instead of requiring operating system modifications, new 3dnow! and mmx technology applications are supported through device drivers, 3dnow! and mmx libraries, or dynamic link library (dll) files. current operating systems have support for floating-point operations and the floating-point register state. using the floating-point registers for 3dnow! and mmx code is a convenient way of implementing non-intrusive support for 3dnow! and mmx instructions. every time the processor executes a 3dnow! or mmx instruction, all the floating-point register tag bits are set to zero (00b=valid), except for the femms and emms instructions, which set all tag bits to one (11b=empty). note: executing the prefetch instruction does not change the tag bits. tag bits 63 0 mm0 mm7 mm1 mm6 mm5 mm2 mm3 mm4 xx xx xx xx xx xx xx xx
6 3dnow!? technology chapter 1 3dnow!? technology manual 21928g/0march 2000 data types 3dnow! technology uses a packed data format. the data is packed in a single, 64-bit 3dnow!/mmx register or a quadword memory operand. figure 2 shows the 3dnow! floating-point data type. d0 and d1 each hold an ieee 32-bit single-precision, floating-point doubleword. figure 2. 3dnow!? data type figure 3 on page 6 shows the format of the ieee 32-bit, single-precision, floating-point format. figure 3. single-precision, floating-point data format 63 0 32 31 (32 bits x 2) two packed, single-precisi on, floating-point doublewords d0 d1 0 31 32-bit, single-prec ision, floating-point doubleword 22 significand biased exponent s value definitions 1. x=(C1) s *0 biased exponent=0 2.x=(C1) s *2 (biased exponent C 127) *significand 0 chapter 1 3dnow!? technology 7 21928g/0march 2000 3dnow!? technology manual figure 4 shows the formats for the integer data types. figure 4. integer data types 63 5 655 47 63 39 31 23 15 7 47 63 63 31 15 48 40 32 24 1 6 0 0 32 48 32 16 0 0 8 31 (8 bits x 8) packed bytes (16 bits x 4) packed words (32 bits x 2) packed doublewords (64 bits x 1) quadword b2 b1 b4 b3 b5 b0 b6 b7 w0 w1 w2 w3 d0 d1 q0
8 3dnow!? technology chapter 1 3dnow!? technology manual 21928g/0march 2000 3dnow!? instruction formats the format of 3dnow! instruction encodings is based on the conventional x86 modr/m instruction format and is similar to the format used by mmx instructions. the assembly language syntax used for the 3dnow! instructions is as follows: 3dnow! mnemonic mmreg1, mmreg2/mem64 the destination and source1 operand (mmreg1) must be an mmx register (mm0Cmm7). the source2 operand (mmreg2/mem64) can be either an mmx register or a 64-bit memory value. the encoding uses the opcode prefix 0fh followed by a second opcode byte of 0fh. to differentiate the various 3dnow! instructions, a third instruction suffix byte is used. this suffix byte occupies the same position at the end of a 3dnow! instructions as would an imm8 byte. the opcode format is as follows: 0fh 0fh modr/m [sib] [displacement] 3dnow!_suffix the specific operands (mmreg1 and mmreg2/mem64) determine the values used in modr/m [sib] [displacement], and follow conventional x86 encodings. the 3dnow! suffix is determined by the actual 3dnow! instruction. the 3dnow! suffixes are defined in table 2 on page 14. as an example, the 3dnow! pfmul instruction can produce the following opcodes, depending on its use: opcode instruction 0f 0f ca b4 pfmul mm1, mm2 0f 0f 0b b4 pfmul mm1, [ebx] 0f 0f 4b 0a b4 pfmul mm1, [ebx+10] 26 0f 0f 0b b4 pfmul mm1, es:[ebx] 0f 0f 4c 83 0a b4 pfmul mm1, [ebx+eax*4+10] the encoding of the two performance-enhancement instructions (femms and prefetch) uses a single opcode prefix 0fh. the details of the opcodes for these instructions are shown on pages 18 and 56 respectively.
chapter 1 3dnow!? technology 9 21928g/0march 2000 3dnow!? technology manual definitions 3dnow! technology provides 21 additional instructions to support high-performance, 3d graphics and audio processing. 3dnow! instructions are vector instructions that operate on 64-bit registers. 3dnow! instructions are simdeach instruction operates on pairs of 32-bit values. the definitions for the 3dnow! instructions starting on page 17 contain designations classifying each instruction as vectored or scalar. vector instructions operate in parallel on two sets of 32-bit, single-precision, floating-point words. instructions that are labeled as scalar instructions operate on a single set of 32-bit operands (from the low halves of the two 64-bit operands). the 3dnow! single-precision, floating-point format is compatible with the ieee-754, single-precision format. this format comprises a 1-bit sign, an 8-bit biased exponent, and a 23-bit significand with one hidden integer bit for a total of 24 bits in the significand. the bias of the exponent is 127, consistent with the ieee single-precision standard. the significands are normalized to be within the range of [1,2). in contrast to the ieee standard that dictates four rounding modes, 3dnow! technology supports one rounding mode either round-to-nearest or round-to-zero (truncation). the hardware implementation of 3dnow! technology determines the rounding mode. the amd processors implement round-to-nearest mode. regardless of the rounding mode used, the floating-point-to-integer and integer-to-floating-point conversion instructions, pf2id and pi2fd, always use the round-to-zero (truncation) mode. the largest, representable, normal number in magnitude for this precision in hexadecimal has an exponent of feh and a significand of 7fffffh, with a numerical value of 2 127 (2 C 2 C23 ). all results that overflow above the maximum-representable positive value are saturated to either this maximum-representable normal number or to positive infinity. similarly, all results that overflow below the minimum-representable negative value are saturated to either
10 3dnow!? technology chapter 1 3dnow!? technology manual 21928g/0march 2000 this minimum-representable normal number or to negative infinity. the implementation of 3dnow! technology determines how arithmetic overflow is handledeither properly signed maximum- or minimum-representable normal numbers or properly signed infinities. the processor generates properly signed maximum- or minimum-representable normal numbers. infinities and nans are not supported as operands to 3dnow! instructions. the smallest representable normal number in magnitude for this precision in hexadecimal has an exponent of 01h and a significand of 000000h, with a numerical value of 2 C126 . accordingly, all results below this minimum representable value in magnitude are held to zero. table 1 shows the exponent ranges supported by the 3dnow! technology. like mmx instructions, 3dnow! instructions do not generate numeric exceptions nor do they set any status flags. it is the user's responsibility to ensure that in-range data is provided to 3dnow! instructions and that all computations remain within valid ranges (or are held as expected). table 1. 3dnow!? technology exponent ranges biased exponent description ffh unsupported * 00h zero 00h chapter 1 3dnow!? technology 11 21928g/0march 2000 3dnow!? technology manual execution resources on amd-k6 ? processors the register operations of all 3dnow! floating-point instructions are executed by either the register x unit or the register y unit. one operation can be issued to each register unit each clock cycle, for a maximum issue and execution rate of two 3dnow! operations per cycle. all 3dnow! operations have an execution latency of two clock cycles and are fully pipelined. even though 3dnow! execution resources are not duplicated in both register units (for example, there are not two pairs of 3dnow! multipliers, just one shared pair of multipliers), there are no instruction-decode or operation-issue pairing restrictions. when, for example, a 3dnow! multiply operation starts execution in a register unit, that unit grabs and uses the one shared pair of 3dnow! multipliers. only when actual contention occurs between two 3dnow! operations starting execution at the same time is one of the operations held up for one cycle in its first execution pipe stage while the other proceeds. the delay is never more than one cycle. for code optimization purposes, 3dnow! operations are grouped into two categories. these categories are based on execution resources and are important when creating properly scheduled code. as long as two 3dnow! operations that start execution simultaneously do not fall into the same category, both operations will start execution without delay. the first category of instructions contains the operations for the following 3dnow! instructions: pfadd, pfsub, pfsubr, pfacc, pfcmpx, pfmin, pfmax, pi2fd, pf2id, pfrcp, and pfrsqrt. the second category contains the operations for the following 3dnow! instructions: pfmul, pfrcpit1, pfrsqit1, and pfrcpit2. note: 3dnow! add and multiply operations, among other combinations, can execute simultaneously. normally, in high-performance 3dnow! code, all of the 3dnow! instructions are properly scheduled apart from each other so as to avoid delays due to execution resource contentions (as well as taking into account dependencies and execution latencies).
12 3dnow!? technology chapter 1 3dnow!? technology manual 21928g/0march 2000 for further information regarding code optimization, see the amd-k6 ? processor code optimization application note , order# 21924. this document provides in-depth discussions of code optimization techniques for the processor. for execution resources information on the amd athlon processor, refer to the amd athlon processor x86 code optimization guide , order# 22007. the simd 3dnow! instructions for all processors are summarized in table 2 on page 14. the dedicated and shared execution resources of the register x unit and register y unit are shown in figure 5 on page 13. the execution resources for some mmx operations, as well as all 3dnow! operations, are shared between the two register units. for contention-checking purposes, each box represents a category of operations that cannot start execution simultaneously. in addition, the mmx and 3dnow! multiplies use the same hardware, while mmx and 3dnow! adds and subtracts do not. the 3dnow! performance-enhancement instructions for all amd processors are summarized in table 3 on page 14. the femms instruction does not use any specific execution resource or pipeline. the prefetch instruction is operated on in the load unit.
chapter 1 3dnow!? technology 13 21928g/0march 2000 3dnow!? technology manual figure 5. register x unit and register y unit resources integer alu integer shift integer multiply and divide integer alu mmx alu add/subtract, compare mmx shifter integer byte operations integer special registers integer segment register loads mmx alu add/subtract, compare mmx alu logical, pack, unpack register x execution pipeline 3dnow!? add/subtract, compare, integer conversion, reciprocal and reciprocal square root table lookup mmx? and 3dnow! multiply, reciprocal and reciprocal square root iteration mmx alu logical, pack, unpack shared register x and y resources register y execution pipeline dedicated register x resources dedicated register y resources
14 3dnow!? technology chapter 1 3dnow!? technology manual 21928g/0march 2000 table 2. 3dnow!? floating-point instructions operation function opcode suffix pavgusb packed 8-bit unsigned integer averaging bfh pfadd packed floating-point addition 9eh pfsub packed floating-point subtraction 9ah pfsubr packed floating-point reverse subtraction aah pfacc packed floating-point accumulate aeh pfcmpge packed floating-point comparison, greater or equal 90h pfcmpgt packed floating-point comparison, greater a0h pfcmpeq packed floating-point comparison, equal b0h pfmin packed floating-point minimum 94h pfmax packed floating-point maximum a4h pi2fd packed 32-bit integer to floating-point conversion 0dh pf2id packed floating-point to 32-bit integer 1dh pfrcp packed floating-point reciprocal approximation 96h pfrsqrt packed floating-point reciprocal square root approximation 97h pfmul packed floating-point multiplication b4h pfrcpit1 packed floating-point reciprocal first iteration step a6h pfrsqit1 packed floating-point reciprocal square root first iteration step a7h pfrcpit2 packed floating-point reciprocal/reciprocal square root second iteration step b6h pmulhrw packed 16-bit integer multiply with rounding b7h table 3. 3dnow!? performance-enhancement instructions operation function opcode second byte femms faster entry/exit of the mmx ? or floating-point state 0eh prefetch/prefetchw * prefetch at least a 32-byte line into l1 data cache (dcache) 0dh note: * the amd-k6-2 and amd-k6- iii processors execute the prefetchw instruction identically to the prefetch instruction. on the amd athlon processor, prefetchw can increase performance by providing a hint to the processor of an intent to modify the cache line.
chapter 1 3dnow!? technology 15 21928g/0march 2000 3dnow!? technology manual task switching with respect to task switching, treat the 3dnow! instructions exactly the same as mmx instructions. operating system design must be taken into account when writing a 3dnow! program. the programmer must know whether the operating system automatically saves the current states when task switching, or if the 3dnow! program has to provide the code to save states. if a task switch occurs, the control register (cr0) task switch (ts) bit is set to 1. the processor then generates an interrupt 7 (int 7device not available) when it encounters the next floating-point, 3dnow!, or mmx instruction, allowing the operating system to save the state of the 3dnow!/mmx/fp registers. in a multitasking operating system, if there is a task switch when 3dnow!/mmx applications are running with older applications that do not include mmx instructions, the mmx/fp register state is still saved automatically through the int 7 handler. exceptions table 4 contains a list of exceptions that 3dnow! and mmx instructions can generate. table 4. 3dnow!? and mmx? instruction exceptions exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x x x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
16 3dnow!? technology chapter 1 3dnow!? technology manual 21928g/0march 2000 the rules for exceptions are the same for both mmx and 3dnow! instructions. in addition, exception detection and handling is identical for mmx and 3dnow! instructions. none of the exception handlers need modification. notes: 1. an invalid opcode exception (interrupt 6) occurs if a 3dnow! instruction is executed on a processor that does not support 3dnow! instructions. 2. if a floating-point exception is pending and the processor encounters a 3dnow! instruction, ferr# is asserted and, if cr0.ne = 1, an interrupt 16 is generated. (this is the same for mmx instructions.) prefixes the following prefixes can be used with 3dnow! instructions: n the segment override prefixes (2eh/cs, 36h/ss, 3eh/ds, 26h/es, 64h/fs, and 65h/gs) affect 3dnow! instructions that contain a memory operand. n the address-size override prefix (67h) affects 3dnow! instructions that contain a memory operand. n the operand-size override prefix (66h) is ignored. n the lock prefix (f0h) triggers an invalid opcode exception (interrupt 6). n the rep prefixes (f3h/ rep/ repe/ repz, f2h/ repne/ repnz) are ignored.
21928g/0march 2000 3dnow!? technology manual chapter 2 3dnow!? instruction set 17 2 3dnow!? instruction set the following 3dnow! instruction definitions are in alphabetical order according to the instruction mnemonics.
18 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 femms mnemonic opcode description femms 0f 0eh faster enter/exit of the mmx or floating-point state privilege: none registers affected: mmx flags affected: none exceptions generated: like the emms instruction, the femms instruction can be used to clear the mmx state following the execution of a block of mmx instructions. because the mmx registers and tag words are shared with the floating-point unit, it is necessary to clear the state before executing floating-point instructions. unlike the emms instruction, the contents of the mmx/floating - point registers are undefined after a femms instruction is executed. therefore, the femms instruction offers a faster context switch at the end of an mmx routine where the values in the mmx registers are no longer required. femms can also be used prior to executing mmx instructions where the preceding floating-point register values are no longer required, which facilitates faster context switching. exception real virtual 8086 protected description invalid opcode (6) x x x the emulate mmx instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit.
chapter 2 3dnow!? instruction set 19 21928g/0march 2000 3dnow!? technology manual pavgusb mnemonic opcode/imm8 description pavgusb mmreg1, mmreg2/mem64 0f 0fh / bfh average of unsigned packed 8-bit values privilege: none registers affected: mmx flags affected: none exceptions generated: the pavgusb instruction produces the rounded averages of the eight unsigned 8-bit integer values in the source operand (an mmx register or a 64-bit memory location) and the eight corresponding unsigned 8-bit integer values in the destination operand (an mmx register). it does so by adding the source and destination byte values and then adding a 001h to the 9-bit intermediate value. the intermediate value is then divided by 2 (shifted right one place) and the eight unsigned 8-bit results are stored in the mmx register specified as the destination operand. the pavgusb instruction can be used for pixel averaging in mpeg-2 motion compensation and video scaling operations. exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
20 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 functional illustration of the pavgusb instruction the following list explains the functional illustration of the pavgusb instruction: n the rounded byte average of ffh and ffh is ffh. n the rounded byte average of ffh and 00h is 80h. n the rounded byte average of 01h and ffh is also 80h. n the rounded byte average of 0fh and 10h is 10h. n the rounded byte average of 00h and 01h is 01h. n the rounded byte average of 70h and 44h is 5ah. n the rounded byte average of 07h and f7h is 7fh. n the rounded byte average of 9ah and a8h is a1h. the equations for byte averaging with rounding are as follows: n mmreg1[63:56] = (mmreg1[63:56] + mmreg2/mem64[63:56] + 01h)/2 n mmreg1[55:48] = (mmreg1[55:48] + mmreg2/mem64[55:48] + 01h)/2 n mmreg1[47:40] = (mmreg1[47:40] + mmreg2/mem64[47:40] + 01h)/2 n mmreg1[39:32] = (mmreg1[39:32] + mmreg2/mem64[39:32] + 01h)/2 n mmreg1[31:24] = (mmreg1[31:24] + mmreg2/mem64[31:24] + 01h)/2 n mmreg1[23:16] = (mmreg1[23:16] + mmreg2/mem64[23:16] + 01h)/2 n mmreg1[15:8] = (mmreg1[15:8] + mmreg2/mem64[15:8] + 01h)/2 n mmreg1[7:0] = (mmreg1[7:0] + mmreg2/mem64[7:0] + 01h)/2 ffh ffh 01h 0fh 9ah 00h 70h 07h mmreg2/mem64 mmreg1 per byte averaging ====== = = ffh 80h 80h 10h a1h 01h 5ah 7fh mmreg1 ffh 00h ffh 10h a8h 01h 44h f7h 0 63 0 63 0 63 indicates a value that was rounded-up
chapter 2 3dnow!? instruction set 21 21928g/0march 2000 3dnow!? technology manual pf2id mnemonic opcode/imm8 description pf2id mmreg1, mmreg2/mem64 0fh 0fh / 1dh converts packed floating-point operand to packed 32-bit integer privilege: none registers affected: mmx flags affected: none exceptions generated: pf2id is a vector instruction that converts a vector register containing single-precision, floating-point operands to 32-bit signed integers using truncation. table 5 on page 22 shows the numerical range of the pf2id instruction. the pf2id instruction performs the following operations: if (mmreg2/mem64[31:0] >= 2 31 ) then mmreg1[31:0] = 7fff_ffffh elseif (mmreg2/mem64[31:0] <= C2 31 ) then mmreg1[31:0] = 8000_0000h else mmreg1[31:0] = int(mmreg2/mem64[31:0]) if (mmreg2/mem64[63:32] >= 2 31 ) then mmreg1[63:32] = 7fff_ffffh elseif (mmreg2/mem64[63:32] <= C2 31 ) then mmreg1[63:32] = 8000_0000h else mmreg1[63:32] = int(mmreg2/mem64[63:32]) exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
22 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 related instructions see the pi2fd instruction. table 5. numerical range for the pf2id instruction source 2 source 1 and destination 00 normal, abs(source 1) <1 0 normal, ?2147483648 < source 1 <= ?1 round to zero (source 1) normal, 1 <= source 1< 2147483648 round to zero (source 1) normal, source 1 >= 2147483648 7fff_ffffh normal, source 1 <= ?2147483648 8000_0000h unsupported undefined
chapter 2 3dnow!? instruction set 23 21928g/0march 2000 3dnow!? technology manual pfacc mnemonic opcode/imm8 description pfacc mmreg1, mmreg2/mem64 0fh 0fh / aeh floating-point accumulate privilege: none registers affected: mmx flags affected: none exceptions generated: pfacc is a vector instruction that accumulates the two words of the destination operand and the source operand and stores the results in the low and high words of destination operand respectively. both operands are single-precision, floating-point operands with 24-bit significands. table 6 on page 24 shows the numerical range of the pfacc instruction. the pfacc instruction performs the following operations: temp = mmreg2/mem64 mmreg1[31:0] = mmreg1[31:0] + mmreg1[63:32] mmreg1[63:32] = temp[31:0] + temp[63:32] exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
24 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 table 6. numerical range for the pfacc instruction source 2 0 normal unsupported source 1 and destination 0 +/? 0 1 source 2 source 2 normal source 1 normal, +/? 0 2 undefined unsupported source 1 undefined undefined notes: 1. the sign of the result is the logical and of the signs of the source operands. 2. if the absolute value of the result is less then 2 C126 , the result is zero with the sign being the sign of the source operand that is larger in magnitude (if the magnitudes are equal, the sign of source 1 is used). if the absolute value of the result is greater than or equal to 2 128 , the result is the largest normal number with the sign being the sign of the source operand that is larger in magnitude.
chapter 2 3dnow!? instruction set 25 21928g/0march 2000 3dnow!? technology manual pfadd mnemonic opcode/imm8 description pfadd mmreg1, mmreg2/mem64 0fh 0fh / 9eh packed, floating-point addition privilege: none registers affected: mmx flags affected: none exceptions generated: pfadd is a vector instruction that performs addition of the destination operand and the source operand. both operands are single-precision, floating-point operands with 24-bit significands. table 7 on page 26 shows the numerical range of the pfadd instruction. the pfadd instruction performs the following operations: mmreg1[31:0] = mmreg1[31:0] + mmreg2/mem64[31:0] mmreg1[63:32] = mmreg1[63:32] + mmreg2/mem64[63:32] exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
26 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 table 7. numerical range for the pfadd instruction source 2 0 normal unsupported source 1 and destination 0 +/? 0 1 source 2 source 2 normal source 1 normal, +/? 0 2 undefined unsupported source 1 undefined undefined notes: 1. the sign of the result is the logical and of the signs of the source operands. 2. if the absolute value of the result is less then 2 C126 , the result is zero with the sign being the sign of the source operand that is larger in magnitude (if the magnitudes are equal, the sign of source 1 is used). if the absolute value of the result is greater than or equal to 2 128 , the result is the largest normal number with the sign being the sign of the source operand that is larger in magnitude.
chapter 2 3dnow!? instruction set 27 21928g/0march 2000 3dnow!? technology manual pfcmpeq mnemonic opcode/imm8 description pfcmpeq mmreg1, mmreg2/mem64 0fh 0fh / b0h packed floating-point comparison, equal to privilege: none registers affected: mmx flags affected: none exceptions generated: pfcmpeq is a vector instruction that performs a comparison of the destination operand and the source operand and generates all one bits or all zero bits based on the result of the corresponding comparison. table 8 on page 28 shows the numerical range of the pfcmpeq instruction. the pfcmpeq instruction performs the following operations: if (mmreg1[31:0] = mmreg2/mem64[31:0]) then mmreg1[31:0] = ffff_ffffh else mmreg1[31:0] = 0000_0000h if (mmreg1[63:32] = mmreg2/mem64[63:32] then mmreg1[63:32] = ffff_ffffh else mmreg1[63:32] = 0000_0000h exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
28 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 related instructions see the pfcmpge instruction. see the pfcmpgt instruction. table 8. numerical range for the pfcmpeq instruction source 2 0 normal unsupported source 1 and destination 0 ffff_ffffh 1 0000_0000h 0000_0000h normal 0000_0000h 0000_0000h, ffff_ffffh 2 0000_0000h unsupported 0000_0000h 0000_0000h undefined notes: 1. positive zero is equal to negative zero. 2. the result is ffff_ffffh if source 1 and source 2 have identical signs, exponents, and mantissas. otherwise, the result is 0000_0000h.
chapter 2 3dnow!? instruction set 29 21928g/0march 2000 3dnow!? technology manual pfcmpge mnemonic opcode/imm8 description pfcmpge mmreg1, mmreg2/mem64 0fh 0fh / 90h packed floating-point comparison, greater than or equal to privilege: none registers affected: mmx flags affected: none exceptions generated: pfcmpge is a vector instruction that performs a comparison of the destination operand and the source operand and generates all one bits or all zero bits based on the result of the corresponding comparison. table 9 on page 30 shows the numerical range of the pfcmpge instruction. the pfcmpge instruction performs the following operations: if (mmreg1[31:0] >= mmreg2/mem64[31:0]) then mmreg1[31:0] = ffff_ffffh else mmreg1[31:0] = 0000_0000h if (mmreg1[63:32] >= mmreg2/mem64[63:32] then mmreg1[63:32] = ffff_ffffh else mmreg1[63:32] = 0000_0000h exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
30 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 related instructions see the pfcmpeq instruction. see the pfcmpgt instruction. table 9. numerical range for the pfcmpge instruction source 2 0 normal unsupported source 1 and destination 0 ffff_ffffh 1 0000_0000h, ffff_ffffh 2 undefined normal 0000_0000h, ffff_ffffh 3 0000_0000h, ffff_ffffh 4 undefined unsupported undefined undefined undefined notes: 1. positive zero is equal to negative zero. 2. the result is ffff_ffffh, if source 2 is negative. otherwise, the result is 0000_0000h. 3. the result is ffff_ffffh, if source 1 is positive. otherwise, the result is 0000_0000h. 4. the result is ffff_ffffh, if source 1 is positive and source 2 is negative, or if they are both negative and source 1 is smal ler than or equal in magnitude to source 2, or if source 1 and source 2 are both positive and source 1 is greater than or equal in magnitude to source 2. the result is 0000_0000h in all other cases.
chapter 2 3dnow!? instruction set 31 21928g/0march 2000 3dnow!? technology manual pfcmpgt mnemonic opcode/imm8 description pfcmpgt mmreg1, mmreg2/mem64 0fh 0fh / a0h packed floating-point comparison, greater than privilege: none registers affected: mmx flags affected: none exceptions generated: pfcmpgt is a vector instruction that performs a comparison of the destination operand and the source operand and generates all one bits or all zero bits based on the result of the corresponding comparison. table 10 on page 32 shows the numerical range of the pfcmpgt instruction. the pfcmpgt instruction performs the following operations: if (mmreg1[31:0] > mmreg2/mem64[31:0]) then mmreg1[31:0] = ffff_ffffh else mmreg1[31:0] = 0000_0000h if (mmreg1[63:32] > mmreg2/mem64[63:32] then mmreg1[63:32] = ffff_ffffh else mmreg1[63:32] = 0000_0000h exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
32 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 related instructions see the pfcmpeq instruction. see the pfcmpge instruction. table 10. numerical range for the pfcmpgt instruction source 2 0 normal unsupported source 1 and destination 0 0000_0000h 0000_0000h, ffff_ffffh 1 undefined normal 0000_0000h, ffff_ffffh 2 0000_0000h, ffff_ffffh 3 undefined unsupported undefined undefined undefined notes: 1. the result is ffff_ffffh, if source 2 is negative. otherwise, the result is 0000_0000h. 2. the result is ffff_ffffh, if source 1 is positive. otherwise, the result is 0000_0000h. 3. the result is ffff_ffffh, if source 1 is positive and source 2 is negative, or if they are both negative and source 1 is smaller in magnitude than source 2, or if source 1 and source 2 are positive and source 1 is greater in magnitude than source 2. the resul t is 0000_0000h in all other cases.
chapter 2 3dnow!? instruction set 33 21928g/0march 2000 3dnow!? technology manual pfmax mnemonic opcode/imm8 description pfmax mmreg1, mmreg2/mem64 0fh 0fh / a4h packed floating-point maximum privilege: none registers affected: mmx flags affected: none exceptions generated: pfmax is a vector instruction that returns the larger of the two single-precision, floating-point operands. any operation with a zero and a negative number returns positive zero. an operation consisting of two zeros returns positive zero. table 11 on page 34 shows the numerical range of the pfmax instruction. the pfmax instruction performs the following operations: if (mmreg1[31:0] > mmreg2/mem64[31:0]) then mmreg1[31:0] = mmreg1[31:0] else mmreg1[31:0] = mmreg2/mem64[31:0] if (mmreg1[63:32] > mmreg2/mem64[63:32]) then mmreg1[63:32] = mmreg1[63:32] else mmreg1[63:32] = mmreg2/mem64[63:32] exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
34 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 related instructions see the pfmin instruction. table 11. numerical range for the pfmax instruction source 2 0 normal unsupported source 1 and destination 0 +0 source 2, +0 1 undefined normal source 1, +0 2 source 1/source 2 3 undefined unsupported undefined undefined undefined notes: 1. the result is source 2, if source 2 is positive. otherwise, the result is positive zero. 2. the result is source 1, if source 1 is positive. otherwise, the result is positive zero. 3. the result is source 1, if source 1 is positive and source 2 is negative. the result is source 1, if both are positive and so urce 1 is greater in magnitude than source 2. the result is source 1, if both are negative and source 1 is lesser in magnitude than sourc e 2. the result is source 2 in all other cases.
chapter 2 3dnow!? instruction set 35 21928g/0march 2000 3dnow!? technology manual pfmin mnemonic opcode/imm8 description pfmin mmreg1, mmreg2/mem64 0fh 0fh / 94h packed floating-point minimum privilege: none registers affected: mmx flags affected: none exceptions generated: pfmin is a vector instruction that returns the smaller of the two single-precision, floating-point operands. any operation with a zero and a positive number returns positive zero. an operation consisting of two zeros returns positive zero. table 12 on page 36 shows the numerical range of the pfmin instruction. the pfmin instruction performs the following operations: if (mmreg1[31:0] < mmreg2/mem64[31:0]) then mmreg1[31:0] = mmreg1[31:0] else mmreg1[31:0] = mmreg2/mem64[31:0] if (mmreg1[63:32] < mmreg2/mem64[63:32]) then mmreg1[63:32] = mmreg1[63:32] else mmreg1[63:32] = mmreg2/mem64[63:32] exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
36 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 related instructions see the pfmax instruction. table 12. numerical range for the pfmin instruction source 2 0 normal unsupported source 1 and destination 0 +0 source 2, +0 1 undefined normal source 1, +0 2 source 1/source 2 3 undefined unsupported undefined undefined undefined notes: 1. the result is source 2, if source 2 is negative. otherwise, the result is positive zero. 2. the result is source 1, if source 1 is negative. otherwise, the result is positive zero. 3. the result is source 1, if source 1 is negative and source 2 is positive. the result is source 1, if both are negative and so urce 1 is greater in magnitude than source 2. the result is source 1, if both are positive and source 1 is lesser in magnitude than sourc e 2. the result is source 2 in all other cases.
chapter 2 3dnow!? instruction set 37 21928g/0march 2000 3dnow!? technology manual pfmul mnemonic opcode/imm8 description pfmul mmreg1, mmreg2/mem64 0fh 0fh / b4h packed floating-point multiplication privilege: none registers affected: mmx flags affected: none exceptions generated: pfmul is a vector instruction that performs multiplication of the destination operand and the source operand. both operands are single-precision, floating-point operands with 24-bit significands. table 13 on page 38 shows the numerical range of the pfmul instruction. the pfmul instruction performs the following operations: mmreg1[31:0] = mmreg1[31:0] * mmreg2/mem64[31:0] mmreg1[63:32] = mmreg1[63:32] * mmreg2/mem64[63:32] exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
38 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 table 13. numerical range for the pfmul instruction source 2 0 normal unsupported source 1 and destination 0 +/? 0 1 +/? 0 1 +/? 0 1 normal +/? 0 1 normal, +/? 0 2 undefined unsupported +/? 0 1 undefined undefined notes: 1. the sign of the result is the exclusive-or of the signs of the source operands. 2. if the absolute value of the result is less then 2 C126 , the result is zero with the sign being the exclusive-or of the signs of the source operands. if the absolute value of the product is greater than or equal to 2 128 , the result is the largest normal number with the sign being exclusive-or of the signs of the source operands.
chapter 2 3dnow!? instruction set 39 21928g/0march 2000 3dnow!? technology manual pfrcp mnemonic opcode/imm8 description pfrcp mmreg1, mmreg2/mem64 0fh 0fh / 96h floating-point reciprocal approximation privilege: none registers affected: mmx flags affected: none exceptions generated: pfrcp is a scalar instruction that returns a low-precision estimate of the reciprocal of the source operand. the single result value is duplicated in both high and low halves of this instructions 64-bit result. the source operand is single-precision with a 24-bit significand, and the result is accurate to 14 bits. table 14 on page 40 shows the numerical range of the pfrcp instruction. increased accuracy (the full 24 bits of a single-precision significand) requires the use of two additional instructions (pfrcpit1 and pfrcpit2). the first stage of this increase or refinement in accuracy (pfrcpit1) requires that the input and output of the already executed pfrcp instruction be used as input to the pfrcpit1 instruction. refer to division and square root on page 59 for an application-specific example of how to use this instruction and related instructions. the pfrcp instruction performs the following operations: mmreg1[31:0] = reciprocal(mmreg2/mem64[31:0]) mmreg1[63:32] = reciprocal(mmreg2/mem64[31:0]) exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
40 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 in the following code example, the bold line illustrates the pfrcp instruction in a sequence used to compute q = a/b accurate to 24 bits: x 0 = pfrcp(b) x 1 = pfrcpit1(b,x 0 ) x 2 = pfrcpit2(x 1 ,x 0 ) q = pfmul(a,x 2 ) related instructions see the pfrcpit1 instruction. see the pfrcpit2 instruction. table 14. numerical range for the pfrcp instruction source 1 and destination source 2 0 +/? maximum normal 1 normal normal, +/? 0 2 unsupported undefined notes: 1. the result has the same sign as the source operand. 2. if the absolute value of the result is less then 2 C126 , the result is zero with the sign being the sign of the source operand. otherwise, the result is a normal with the sign being the same sign as the source operand.
chapter 2 3dnow!? instruction set 41 21928g/0march 2000 3dnow!? technology manual pfrcpit1 mnemonic opcode/imm8 description pfrcpit1 mmreg1, mmreg2/mem64 0fh 0fh / a6h packed floating-point reciprocal, first iteration step privilege: none registers affected: mmx flags affected: none exceptions generated: pfrcpit1 is a vector instruction that performs the first intermediate step in the newton-raphson iteration to refine the reciprocal approximation produced by the pfrcp instruction (the second and final step completes the iteration and is accurate to 24 bits). table 15 on page 42 shows the numerical range of the pfrcpit1 instruction. the behavior of this instruction is only defined for those combinations of operands such that one source operand was the input to the pfrcp instruction and the other source operand was the output of the same pfrcp instruction. refer to division and square root on page 59 for an application-specific example of how to use this instruction and related instructions. exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
42 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 in the following code example, the bold line illustrates the pfrcpit1 instruction in a sequence used to compute q = a/b accurate to 24 bits: x 0 = pfrcp(b) x 1 = pfrcpit1(b,x 0 ) x 2 = pfrcpit2(x 1 ,x 0 ) q = pfmul(a,x 2 ) related instructions see the pfrcp instruction. see the pfrcpit2 instruction. table 15. numerical range for the pfrcpit1 instruction source 2 0 normal unsupported source 1 and destination 0 +/? 0 1 +/? 0 1 +/? 0 1 normal +/? 0 1 normal 2 undefined unsupported +/? 0 1 undefined undefined notes: 1. the sign of the result is the exclusive-or of the signs of the source operands. 2. the sign is positive.
chapter 2 3dnow!? instruction set 43 21928g/0march 2000 3dnow!? technology manual pfrcpit2 mnemonic opcode/imm8 description pfrcpit2 mmreg1, mmreg2/mem64 0fh 0fh / b6h packed floating-point reciprocal/reciprocal square root, second iteration step privilege: none registers affected: mmx flags affected: none exceptions generated: pfrcpit2 is a vector instruction that performs the second and final intermediate step in the newton-raphson iteration to refine the reciprocal or reciprocal square root approximation produced by the pfrcp and pfsqrt instructions, respectively. table 16 on page 44 shows the numerical range of the pfrcpit2 instruction. the behavior of this instruction is only defined for those combinations of operands such that the first source operand (mmreg1) was the output of either the pfrcpit1 or pfrsqit1 instructions and the second source operand (mmreg2/mem64) was the output of either the pfrcp or pfrsqrt instructions. refer to division and square root on page 59 for an application-specific example of how to use this instruction and related instructions. exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
44 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 in the following code example, the bold line illustrates the pfrcpit2 instruction in a sequence used to compute q = a/b accurate to 24 bits: x 0 = pfrcp(b) x 1 = pfrcpit1(b,x 0 ) x 2 = pfrcpit2(x 1 ,x 0 ) q = pfmul(a,x 2 ) related instructions see the pfrcpit1 instruction. see the pfrsqit1 instruction. see the pfrcp instruction. see the pfrsqrt instruction. table 16. numerical range for the pfrcpit2 instruction source 2 0 normal unsupported source 1 and destination 0 +/? 0 1 +/? 0 1 +/? 0 1 normal +/? 0 1 normal, +/? 0 2 undefined unsupported +/? 0 1 undefined undefined notes: 1. the sign of the result is the exclusive-or of the signs of the source operands. 2. if the absolute value of the result is less then 2 C126 , the result is zero with the sign being the exclusive-or of the signs of the source operands. if the absolute value of the product is greater than or equal to 2 128 , the result is the largest normal number with the sign being exclusive-or of the signs of the source operands.
chapter 2 3dnow!? instruction set 45 21928g/0march 2000 3dnow!? technology manual pfrsqit1 mnemonic opcode/imm8 description pfrsqit1 mmreg1, mmreg2/mem64 0fh 0fh / a7h packed floating-point reciprocal square root, first iteration step privilege: none registers affected: mmx flags affected: none exceptions generated: pfrsqit1 is a vector instruction that performs the first intermediate step in the newton-raphson iteration to refine the reciprocal square root approximation produced by the pfsqrt instruction (the second and final step completes the iteration and is accurate to 24 bits). table 17 on page 46 shows the numerical range of the pfrsqit2 instruction. the behavior of this instruction is only defined for those combinations of operands such that one source operand was the input to the pfrsqrt instruction and the other source operand is the square of the output of the same pfrsqrt instruction. refer to division and square root on page 59 for an application-specific example of how to use this instruction and related instructions. exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
46 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 in the following code example, the bold lines illustrate the pfmul and pfrsqit1 instructions in a sequence used to compute a = 1/sqrt (b) accurate to 24 bits: x 0 = pfrsqrt(b) x 1 = pfmul(x 0 ,x 0 ) x 2 = pfrsqit1(b,x 1 ) a = pfrcpit2(x 2 ,x 0 ) related instructions see the pfrcpit2 instruction. see the pfrsqrt instruction. table 17. numerical range for the pfrsqit1 instruction source 2 0 normal unsupported source 1 and destination 0 +/? 0 1 +/? 0 1 +/? 0 1 normal +/? 0 1 normal 2 undefined unsupported +/? 0 1 undefined undefined notes: 1. the sign of the result is the exclusive-or of the signs of the source operands. 2. the sign is 0.
chapter 2 3dnow!? instruction set 47 21928g/0march 2000 3dnow!? technology manual pfrsqrt mnemonic opcode/imm8 description pfrsqrt mmreg1, mmreg2/mem64 0fh 0fh / 97h floating-point reciprocal square root approximation privilege: none registers affected: mmx flags affected: none exceptions generated: pfrsqrt is a scalar instruction that returns a low-precision estimate of the reciprocal square root of the source operand. the single result value is duplicated in both high and low halves of this instructions 64-bit result. the source operand is single-precision with a 24-bit significand, and the result is accurate to 15 bits. negative operands are treated as positive operands for purposes of reciprocal square root computation, with the sign of the result the same as the sign of the source operand. table 18 on page 48 shows the numerical range of the pfrsqrt instruction. increased accuracy (the full 24 bits of a single-precision significand) requires the use of two additional instructions (pfrsqit1 and pfrcpit2). the first stage of this increase or refinement in accuracy (pfrsqit1) requires that the input and squared output of the already executed pfrsqrt instruction be used as input to the pfrsqit1 instruction. refer to division and square root on page 59 for an application-specific example of how to use this instruction and related instructions. exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
48 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 the pfrsqrt instruction performs the following operations: mmreg1[31:0] = reciprocal square root(mmreg2/mem64[31:0]) mmreg1[63:32] = reciprocal square root(mmreg2/mem64[31:0]) in the following code example, the bold line illustrates the pfrsqrt instruction in a sequence used to compute a = 1/sqrt (b) accurate to 24 bits: x 0 = pfrsqrt(b) x 1 = pfmul(x 0 ,x 0 ) x 2 = pfrsqit1(b,x 1 ) a = pfrcpit2(x 2 ,x 0 ) related instructions see the pfrsqit1 instruction. see the pfrcpit2 instruction. table 18. numerical range for the pfrsqrt instruction source 1 and destination source 2 0 +/? maximum normal* normal normal * unsupported undefined * note: * the result has the same sign as the source operand.
chapter 2 3dnow!? instruction set 49 21928g/0march 2000 3dnow!? technology manual pfsub mnemonic opcode/imm8 description pfsub mmreg1, mmreg2/mem64 0fh 0fh / 9ah packed floating-point subtraction privilege: none registers affected: mmx flags affected: none exceptions generated: pfsub is a vector instruction that performs subtraction of the source operand from the destination operand. both operands are single-precision, floating-point operands with 24-bit significands. table 19 on page 50 shows the numerical range of the pfsub instruction. the pfsub instruction performs the following operations: mmreg1[31:0] = mmreg1[31:0] C mmreg2/mem64[31:0] mmreg1[63:32] = mmreg1[63:32] C mmreg2/mem64[63:32] exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
50 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 related instructions see the pfsubr instruction. table 19. numerical range for the pfsub instruction source 2 0 normal unsupported source 1 and destination 0 +/? 0 1 source 2 source 2 normal source 1 normal, +/? 0 2 undefined unsupported source 1 undefined undefined notes: 1. the sign of the result is the logical and of the sign of source 1 and the inverse of the sign of source 2. 2. if the absolute value of the result is less then 2 C126 , the result is zero with the sign being the sign of the source operand that is larger in magnitude (if the magnitudes are equal, the sign of source 1 is used). if the absolute value of the result is greater than or equal to 2 128 , the result is the largest normal number with the sign being the sign of the source operand that is larger in magnitude.
chapter 2 3dnow!? instruction set 51 21928g/0march 2000 3dnow!? technology manual pfsubr mnemonic opcode/imm8 description pfsubr mmreg1, mmreg2/mem64 0fh 0fh / aah packed floating-point reverse subtraction privilege: none registers affected: mmx flags affected: none exceptions generated: pfsubr is a vector instruction that performs subtraction of the destination operand from the source operand. both operands are single-precision, floating-point operands with 24-bit significands. table 20 on page 52 shows the numerical range of the pfsubr instruction. the pfsubr instruction performs the following operations: mmreg1[31:0] = mmreg2/mem64[31:0] C mmreg1[31:0] mmreg1[63:32] = mmreg2/mem64[63:32] C mmreg1[63:32] exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
52 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 related instructions see the pfsub instruction. table 20. numerical range for the pfsubr instruction source 2 0 normal unsupported source 1 and destination 0 +/? 0 1 source 2 source 2 normal source 1 normal, +/? 0 2 undefined unsupported source 1 undefined undefined notes: 1. the sign of the result is the logical and of the sign of source 1 and the inverse of the sign of source 2. 2. if the absolute value of the result is less then 2 C126 , the result is zero with the sign being the sign of the source operand that is larger in magnitude (if the magnitudes are equal, the sign of source 2 is used). if the absolute value of the result is greater than or equal to 2 128 , the result is the largest normal number with the sign being the sign of the source operand that is larger in magnitude.
chapter 2 3dnow!? instruction set 53 21928g/0march 2000 3dnow!? technology manual pi2fd mnemonic opcode/imm8 description pi2fd mmreg1, mmreg2/mem64 0fh 0fh / 0dh packed 32-bit integer to floating-point conversion privilege: none registers affected: mmx flags affected: none exceptions generated pi2fd is a vector instruction that converts a vector register containing signed, 32-bit integers to single-precision, floating-point operands. when pi2fd converts an input operand with more significant digits than are available in the output, the output is truncated. the pi2fd instruction performs the following operations: mmreg1[31:0] = float(mmreg2/mem64[31:0]) mmreg1[63:32] = float(mmreg2/mem64[63:32]) related instructions see the pf2id instruction. exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
54 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 pmulhrw mnemonic opcode/imm8 description pmulhrw mmreg1, mmreg2/mem64 0f 0fh/b7h multiply signed packed 16-bit values with rounding and store the high 16 bits. privilege: none registers affected: mmx flags affected: none exceptions generated: the pmulhrw instruction multiplies the four signed 16-bit integer values in the source operand (an mmx register or a 64-bit memory location) by the four corresponding signed 16-bit integer values in the destination operand (an mmx register). the pmulhrw instruction then adds 8000h to the lower 16 bits of the 32-bit result, which results in the rounding of the high-order, 16-bit result. the high-order 16 bits of the result (including the sign bit) are stored in the destination operand. the pmulhrw instruction provides a numerically more accurate result than the pmulmh instruction, which truncates the result instead of rounding. exception real virtual 8086 protected description invalid opcode (6) x x x the emulate instruction bit (em) of the control register (cr0) is set to 1. device not available (7) x x x save the floating-point or mmx state if the task switch bit (ts) of the control register (cr0) is set to 1. stack exception (12) x during instruction execution, the stack segment limit was exceeded. general protection (13) x during instruction execution, the effective address of one of the segment registers used for the operand points to an illegal memory location. segment overrun (13) x x one of the instruction data operands falls outside the address range 00000h to 0ffffh. page fault (14) x x a page fault resulted from the execution of the instruction. floating-point exception pending (16) x x x an exception is pending due to the floating-point execution unit. alignment check (17) x x an unaligned memory reference resulted from the instruction execution, and the alignment mask bit (am) of the control register (cr0) is set to 1. (in protected mode, cpl = 3.)
chapter 2 3dnow!? instruction set 55 21928g/0march 2000 3dnow!? technology manual functional illustration of the pmulhrw instruction the following list explains the functional illustration of the pmulhrw instruction: n the signed 16-bit negative value d250h (C2db0h) is multiplied by the signed 16-bit negative value 8807h (C77f9h) to produce the signed 32-bit positive result of 1569_4030h. 8000h is then added to the lower 16 bits to produce a final result of 1569_c030h. this rounding does not affect the final result of 1569h. the signed high-order 16 bits of the result are stored in the destination operand. n the signed 16-bit positive value 5321h is multiplied by the signed 16-bit negative value ec22h (C13deh) to produce the signed 32-bit negative result of f98c_7662h (C0673_899eh). 8000h is then added to the lower 16 bits, producing a final result of f98c_f662h. this rounding does not affect the final result of f98ch. the signed high-order 16 bits of the result are stored in the destination operand. n the signed 16-bit positive value 7007h is multiplied by the signed 16-bit positive value 7ffeh to produce the signed 32-bit positive result of 3802_9ff2h. 8000h is then added to the lower 16 bits to produce a final result of 3803_1ff2h. this result has been rounded up. the signed high-order 16 bits of the result (3803h) are stored in the destination operand. n the signed 16-bit negative value ffffh (C1) is multiplied by the signed 16-bit negative value ffffh (C1) to produce the signed 32-bit positive result of 0000_0001h. 8000h is then added to the lower 16 bits to produce a final result of 0000_8001h. this rounding does not affect the final result of 0000h. the signed high-order 16 bits of the result are stored in the destination operand. **** ==== mmreg2/mem64 mmreg1 mmreg1 ffffh d250h ffffh 0000h 7007h 7ffeh 3803h 5321h ec22h f98ch 8807h 1569h 63 63 63 0 0 0 indicates a value that was rounded-up
56 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000 prefetch/prefetchw mnemonic opcode description prefetch(w) mem8 0f 0dh prefetch processor cache line into l1 data cache (dcache) privilege: none registers affected: none flags affected: none exceptions generated: none the prefetch instruction loads a processor cache line into the data cache. the address of this line is specified by the mem8 value. for the amd processor, the line size is 32 bytes. in all future processors, the size of the line that is loaded by the prefetch instruction will be at least 32-bytes. the prefetch instruction loads a cache line even if the mem8 address is not aligned with the start of the line (although some implementations, including the amd-k6 family of processors, may perform the cache fill starting from the cache miss or mem8 address). if a cache hit occurs (the line is already in the dcache) or a memory fault is detected, no bus cycle is initiated and the instruction is treated as a nop. in applications where a large number of data sets must be processed, the prefetch instruction can pre-load the next data set into the dcache while, simultaneously, the processor is operating on the present set of data. this instruction allows the programmer to explicitly code operation concurrency. when the present set of data values is completed, the next set is already available in the dcache. an example of a concurrent operation is vertices processing in 3d transformations, where the next set of vertices can be prefetched into the data cache while the present set is being transformed. the prefetch instruction format in the processor is defined to allow extensions in future amd k86 ? processors. the instruction mnemonic for the prefetch instruction includes the modr/m byte. only the memory form of modr/m is valid (use of the register form results in an invalid opcode exception). because there is no destination register, the three destination register field bits of the modr/m byte are used to define the type of prefetch to be performed. the prefetch and prefetchw instructions are defined by the bit pattern 000b and 001b, respectively. all other bit patterns are reserved for future use. the prefetchw instruction loads the prefetched line and sets the cache line mesi state to modified (in anticipation of subsequent data writes to the line), unlike the prefetch instruction, which typically sets the state to exclusive. if the data that is prefetched into the dcache is to be modified, use of the prefetchw instruction
chapter 2 3dnow!? instruction set 57 21928g/0march 2000 3dnow!? technology manual will save the cycle that the prefetch instruction requires for modifying the dcache line state. the prefetchw instruction should be used when the programmer expects that the data in the cache line will be modified. otherwise, the prefetch instruction should be used. note: the amd-k6-2 and amd-k6-iii processors execute the prefetchw instruction identically to the prefetch instruction. however, the amd athlon and future amd processors that support prefetchw as described above will be able to take advantage of the performance benefit provided by this instruction. for more information, see the amd athlon processor x86 code optimization guide, order# 22007. table 21 summarizes the prefetch type options: note: the reserved prefetch types do not result in an invalid opcode exception if executed. instead, for forward compatibility with future processors that may implement additional forms of the prefetch instruction, all reserved prefetch types are implemented as synonyms for the basic prefetch type (for example, the prefetch instruction with type 000b). table 21. summary of prefetch instruction type options mod r/m result 11-xxx-xxx invalid opcode mm-000-xxx prefetch mm-001-xxx prefetchw mm-010-xxx reserved mm-011-xxx reserved mm-100-xxx reserved mm-101-xxx reserved mm-110-xxx reserved mm-111-xxx reserved
58 3dnow!? instruction set chapter 2 3dnow!? technology manual 21928g/0march 2000
21928g/0march 2000 3dnow!? technology manual chapter 3 division and square root 59 3 division and square root division the 3dnow! instructions can be used to compute a very fast, highly accurate reciprocal or quotient. consider the quotient q = a/b. an on-chip, rom-based table lookup can be used to quickly produce a 14C15 bit precision approximation of 1/b (using just one two-cycle latency instructionpfrcp). a full-precision reciprocal can then quickly be computed from this approximation using a newton-raphson algorithm. the general newton-raphson recurrence for the reciprocal is as follows: z i +1 ? z i ? (2 C b ? z i ) given that the initial approximation is accurate to at least 14 bits, and that full ieee single precision contains 24 bits of mantissa, just one newton-raphson iteration is required. the following shows the 3dnow! instruction sequence to produce the initial reciprocal approximation, to compute the full-precision reciprocal from this, and lastly, to complete the required division of a/b.
60 division and square root chapter 3 3dnow!? technology manual 21928g/0march 2000 x 0 = pfrcp(b) x 1 = pfrcpit1(b, x 0 ) x 2 = pfrcpit2(x 1 , x 0 ) q = pfmul(a, x 2 ) the 24-bit final reciprocal value is x 2 . in the amd processor implementation, the estimate contains the correct round-to-nearest value for approximately 99% of all arguments. the remaining arguments differ from the correct round-to-nearest value for the reciprocal by 1 unit-in-the-last-place (ulp). the quotient is formed in the last step by multiplying the reciprocal by the dividend a. divide examples these examples illustrate the use of 3dnow! instructions to perform divides. (14-bit precision) movd mm0, [mem] ; 0 | w pfrcp mm0, mm0 ; 1/w | 1/w (approx.) movq mm2, [mem] ; y | x pfmul mm2, mm0 ; y/w | x/w (24-bit precision) movd mm0, [mem] ; 0 | w pfrcp mm1, mm0 ; 1/w | 1/w (approx.) punpckldq mm0, mm0 ; w | w (mmx instruction) pfrcpit1 mm0, mm1 ; 1/w | 1/w (intermed.) movq mm2, [mem] ; y | x pfrcpit2 mm0, mm1 ; 1/w | 1/w (full prec.) pfmul mm2, mm0 ; y/w | x/w note: for a description of the punpckldq instruction, see the amd-k6 ? processor multimedia technology manual, order# 20726.
chapter 3 division and square root 61 21928g/0march 2000 3dnow!? technology manual square root the 3dnow! instructions can also be used to compute a reciprocal square root or square root with high performance. the general newton-raphson reciprocal square root recurrence is as follows: z i +1 ? 1 /2 ? z i ? (3 C b ? z i 2 ) to reduce the number of iterations, the initial approximation is read from a table. the 3dnow! reciprocal square root approximation is accurate to at least 15 bits. accordingly, to obtain a single-precision 24-bit reciprocal square root of an input operand b, one newton-raphson iteration is required using the following 3dnow! instructions: 1. x 0 = pfrsqrt(b) 2. x 1 = pfmul(x 0 , x 0 ) 3. x 2 = pfrsqit1(b, x 1 ) 4. x 3 = pfrcpit2(x 2 , x 0 ) 5. x 4 = pfmul(b, x 3 ) the 24-bit final reciprocal square root value is x 3 . in the amd implementation, the estimate contains the correct round-to-nearest value for approximately 87% of all arguments. the remaining arguments differ from the correct round-to-nearest value by 1 ulp. the square root (x 4 ) is formed in the last step by multiplying by the input operand b. square root examples these examples illustrate the use of 3dnow! technology to perform square roots. (15-bit precision) movd mm0, [mem] ; 0 | a pfrsqrt mm1, mm0 ; 1/(sqrt a) | 1/(sqrt a) (approx.) punpckldq mm0, mm0 ; a | a (mmx instr.) pfmul mm0, mm1 ; (sqrt a) | (sqrt a)
62 division and square root chapter 3 3dnow!? technology manual 21928g/0march 2000 (24-bit precision) movd mm0, [mem] ; 0 | a pfrsqrt mm1, mm0 ; 1/(sqrt a) | 1/(sqrt a) (approx.) movq mm2, mm1 ; x_0 = 1/(sqrt a) (approx.) pfmul mm1, mm1 ; x_0 * x_0 | x_0 * x_0 step 1 punpckldq mm0, mm0 ; a | a (mmx instr.) pfrsqit1 mm1, mm0 ; (intermediate) step 2 pfrcpit2 mm1, mm2 ; 1/(sqrt a) (full prec.) step 3 pfmul mm0, mm1 ; (sqrt a) | (sqrt a)


▲Up To Search▲   

 
Price & Availability of 21928

All Rights Reserved © IC-ON-LINE 2003 - 2022  

[Add Bookmark] [Contact Us] [Link exchange] [Privacy policy]
Mirror Sites :  [www.datasheet.hk]   [www.maxim4u.com]  [www.ic-on-line.cn] [www.ic-on-line.com] [www.ic-on-line.net] [www.alldatasheet.com.cn] [www.gdcy.com]  [www.gdcy.net]


 . . . . .
  We use cookies to deliver the best possible web experience and assist with our advertising efforts. By continuing to use this site, you consent to the use of cookies. For more information on cookies, please take a look at our Privacy Policy. X